Many of the figures and analysis methodology were based off of this tutorial from NYU. https://learn.gencore.bio.nyu.edu/rna-seq-analysis/gene-set-enrichment-analysis/
Information received from the csv samplesheet:
Validating input parameters…
## ✓ Sample 1 ('chen_phillips') parameters validated
All parameters validated successfully!
Metadata information received from the tsv table:
## replicate strain
## 1 rep1 hrde1
## 2 rep2 hrde1
## 3 rep3 hrde1
## 4 rep1 hrde2
## 5 rep2 hrde2
## 6 rep3 hrde2
## 7 rep1 wt
## 8 rep2 wt
## 9 rep3 wt
Here, count data from the RNA-seq experiment is read in the form of a counts matrix. Each column holds data from one sample, and each row represents a gene, such that the i-th row and n-th column tells how reads of gene i were measured in sample n. The values received should be un-normalized counts of sequencing reads or fragments.
# # Read Count Matrix
counts_file <- trimws(sample_params$salmon_merged_gene_counts_file_path)
count.matrix <- read_tsv(counts_file)
count.matrix <- count.matrix %>%
mutate(across(3:ncol(.), ~ as.integer(.x))) %>%
column_to_rownames("gene_id")
gene.reference <- dplyr::select(count.matrix, 1)
count.matrix <- count.matrix %>%
dplyr::select(-1)
Information entered as metadata describes the samples (columns) of the count matrix. This metadata is combined with the sample/column names from the count matrix so the accuracy of the metadata can be reviewed. Take time to review the table below.
# Design ColData Matrix
sample <- colnames(count.matrix)
colData <- data.frame(sample)
# Add Columns for Condition to colData
for (i in colnames(metadata)){
colData[i] <- metadata[i]
}
colData <- column_to_rownames(colData, "sample")
## replicate strain
## hrde1_rep1 "rep1" "hrde1"
## hrde1_rep2 "rep2" "hrde1"
## hrde1_rep3 "rep3" "hrde1"
## hrde2_rep1 "rep1" "hrde2"
## hrde2_rep2 "rep2" "hrde2"
## hrde2_rep3 "rep3" "hrde2"
## wt_rep1 "rep1" "wt"
## wt_rep2 "rep2" "wt"
## wt_rep3 "rep3" "wt"
Using the count matrix, column metadata, and user-inputed design formula (expressing the variables to be used in modeling), a DESeq data set is created.
The experimental variable and the associated control/experimental conditions are given by the user in “ref_level_cond” and “ref_level_value” in the sampleInfo file respectively.
Pre-filtering is performed, keeping only genes that have a count of at least 10 in a minimum number of samples. The minimum number of samples is decided by calculating the number of times the reference level value (control condition) is listed in the metadata with the assumption that experimental conditions will be repeated the same number of times.
Standard differential expression analysis is performed with the DESeq function, and regularized log transformation (rlog) transforms count data to a log2 scale for PCA.
# Set Reference Level
reflevelcond <- trimws(sample_params$ref_level_cond)
reflevelvalue <- trimws(sample_params$ref_level_value)
# Run DESeq2 analysis
deseq_result <- runDESeq2Analysis(
count.matrix = count.matrix,
colData = colData,
design = sample_params$design,
ref_level_cond = reflevelcond,
ref_level_value = reflevelvalue,
metadata = metadata
)
dds <- deseq_result$dds
rld <- deseq_result$rld
If input data is designated to be batch corrected, the ComBat_seq() function from the SVA Bioconductor package is used. The batch (condition) to be regressed out is given in sampleInfo.csv as “batch_cond”, and the sample information for this condition is taken from the metadata.
The group argument of ComBat_seq() specifies biological covariates, whos signals should be preserved in adjusted data. All conditions (columns) from the metadata which are not to be regressed out are passed to the group argument. Differential expression analysis is otherwise performed as described above on batch-corrected counts.
if (as.logical(trimws(sample_params$batch_correct))) {
# Replicate / Batch Correction
batch <- trimws(sample_params$batch_cond)
batch.correct <- ComBat_seq(
counts = as.matrix(count.matrix),
batch = metadata[[batch]],
group = metadata[[colnames(metadata)[colnames(metadata) != batch]]]) %>%
as.data.frame()
# Run DESeq2 analysis on batch-corrected counts
batch_deseq_result <- runDESeq2Analysis(
count.matrix = batch.correct,
colData = colData,
design = sample_params$design,
ref_level_cond = reflevelcond,
ref_level_value = reflevelvalue,
metadata = metadata
)
batch.correct.dds <- batch_deseq_result$dds
batch.correct.rld <- batch_deseq_result$rld
# Cleanup
rm(batch, batch.correct, keep)
}
Reference level condition and value (ref_level_cond and ref_level_value respectively) inputs from sampleInfo.csv are used to create contrast terms to compare treatment samples with control samples. This relies on the design formula being formatted correctly to produce comparisons relevant to the experimental condition.
# # Find relevant contrasts
contrasts <- list()
# Factor values of experimental condition
pattern <- factor(metadata[[reflevelcond]])
comp <- levels(pattern)
# Remove control state from experimental condition vector
comp <- comp[comp != reflevelvalue]
# Create Contrasts
for (i in 1:length(comp)){
contrasts[[i]] <- c(reflevelcond, comp[i], reflevelvalue)
}
Based on the number of experimental conditions (identified in the ref_level_cond column of the metadata) and the contrast terms comparing them to the experimental control (ref_level_value), differential expression results are extracted from the DESeq data set object.
# Find results from relevant contrasts
results <- list()
# Pull results from DESeq object with contrasts
for (i in 1:length(contrasts)){
results[[i]] <- list()
results[[i]][["DataFrame"]] <- results(dds.for.analysis, contrast = contrasts[[i]]) %>%
data.frame() %>%
rownames_to_column("gene_id") %>%
mutate(gene_name = gene.reference[gene_id, 1], .before=baseMean) %>% # Add gene names to DESeq results matrices
column_to_rownames("gene_id")
results[[i]][["DESeqDataSet"]] <- results(dds.for.analysis, contrast = contrasts[[i]])
}
The Bioconductor package PCAtools is used to perform principle component analysis on regularized log transformed count data. Scree plots show principle component numbers on the x-axis, and their respective eigenvalues on the y-axis.
The third graph plots experimental metadata against a number of PCs and their gene loadings to visualize the agreement with the gene expression pattern of each condition for each PC.
Transfering WORMBASE gene IDs to ENTREZID:
308 gene IDs were not transfered from WORMBASE to ENTREZID
12108 gene IDs were successfully transfered from WORMBASE to ENTREZID
Number of highly variable genes identified (with | L2FC | > 0.8 and padj < 0.1) is:
495
Number of highly upregulated genes identified (with L2FC > 0.8 and padj < 0.1) is:
377
Number of highly downregulated genes identified (with L2FC < -0.8 and padj < 0.1) is:
118
Table Reactome Pathway Over-representation Analysis of Downregulated Genes has no rows.
Transfering WORMBASE gene IDs to ENTREZID:
308 gene IDs were not transfered from WORMBASE to ENTREZID
12108 gene IDs were successfully transfered from WORMBASE to ENTREZID
Number of highly variable genes identified (with | L2FC | > 0.8 and padj < 0.1) is:
1557
Number of highly upregulated genes identified (with L2FC > 0.8 and padj < 0.1) is:
1051
Number of highly downregulated genes identified (with L2FC < -0.8 and padj < 0.1) is:
506
Table Reactome Pathway GSE Of All Differentially Expressed Genes has no rows.
This section provides complete information about the R environment, package versions, and system configuration used to generate this report. This information is critical for reproducing the analysis.
R Version and Platform:
R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS 15.5
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] here_1.0.2 RColorBrewer_1.1-3 lubridate_1.9.4 forcats_1.0.0
[5] purrr_1.1.0 tidyr_1.3.1 tidyverse_2.0.0 patchwork_1.3.2
[9] rrvgo_1.18.0 EnhancedVolcano_1.24.0 org.Ce.eg.db_3.20.0 AnnotationDbi_1.68.0
[13] vsn_3.74.0 sva_3.54.0 BiocParallel_1.40.2 genefilter_1.88.0
[17] mgcv_1.9-3 nlme_3.1-168 GOSemSim_2.32.0 DESeq2_1.46.0
[21] SummarizedExperiment_1.36.0 Biobase_2.66.0 MatrixGenerics_1.18.1 matrixStats_1.5.0
[25] GenomicRanges_1.58.0 GenomeInfoDb_1.42.3 IRanges_2.40.1 S4Vectors_0.44.0
[29] BiocGenerics_0.52.0 clusterProfiler_4.14.6 ReactomePA_1.50.0 readr_2.1.5
[33] tibble_3.3.0 stringr_1.5.2 knitrBootstrap_1.0.3 PCAtools_2.18.0
[37] ggrepel_0.9.6 ggplot2_4.0.0 htmltools_0.5.8.1 ashr_2.2-63
[41] kableExtra_1.4.0 knitr_1.50 DT_0.34.0 dplyr_1.1.4
[45] bookdown_0.44
loaded via a namespace (and not attached):
[1] fs_1.6.6 enrichplot_1.26.6 httr_1.4.7 tools_4.4.0
[5] R6_2.6.1 lazyeval_0.2.2 withr_3.0.2 graphite_1.52.0
[9] gridExtra_2.3 preprocessCore_1.68.0 cli_3.6.5 textshaping_1.0.3
[13] labeling_0.4.3 slam_0.1-55 sass_0.4.10 SQUAREM_2021.1
[17] S7_0.2.0 tm_0.7-16 askpass_1.2.1 mixsqp_0.3-54
[21] systemfonts_1.2.3 yulab.utils_0.2.1 gson_0.1.0 DOSE_4.0.1
[25] svglite_2.2.1 R.utils_2.13.0 invgamma_1.2 limma_3.62.2
[29] rstudioapi_0.17.1 RSQLite_2.4.3 treemap_2.4-4 generics_0.1.4
[33] gridGraphics_0.5-1 crosstalk_1.2.2 vroom_1.6.5 GO.db_3.20.0
[37] Matrix_1.7-4 abind_1.4-8 R.methodsS3_1.8.2 lifecycle_1.0.4
[41] yaml_2.3.10 edgeR_4.4.2 qvalue_2.38.0 SparseArray_1.6.2
[45] grid_4.4.0 blob_1.2.4 promises_1.3.3 dqrng_0.4.1
[49] crayon_1.5.3 ggtangle_0.0.7 lattice_0.22-7 beachmat_2.22.0
[53] cowplot_1.2.0 annotate_1.84.0 KEGGREST_1.46.0 pillar_1.11.0
[57] fgsea_1.32.4 codetools_0.2-20 fastmatch_1.1-6 glue_1.8.0
[61] ggfun_0.2.0 data.table_1.17.8 vctrs_0.6.5 png_0.1-8
[65] treeio_1.30.0 gtable_0.3.6 cachem_1.1.0 xfun_0.53
[69] S4Arrays_1.6.0 mime_0.13 tidygraph_1.3.1 survival_3.8-3
[73] pheatmap_1.0.13 statmod_1.5.1 ggtree_3.14.0 bit64_4.6.0-1
[77] rprojroot_2.1.1 bslib_0.9.0 affyio_1.76.0 irlba_2.3.5.1
[81] colorspace_2.1-2 DBI_1.2.3 tidyselect_1.2.1 bit_4.6.0
[85] compiler_4.4.0 graph_1.84.1 xml2_1.4.0 NLP_0.3-2
[89] DelayedArray_0.32.0 scales_1.4.0 affy_1.84.0 rappdirs_0.3.3
[93] digest_0.6.37 rmarkdown_2.29 XVector_0.46.0 pkgconfig_2.0.3
[97] umap_0.2.10.0 sparseMatrixStats_1.18.0 fastmap_1.2.0 rlang_1.1.6
[101] htmlwidgets_1.6.4 UCSC.utils_1.2.0 shiny_1.11.1 DelayedMatrixStats_1.28.1
[105] farver_2.1.2 jquerylib_0.1.4 jsonlite_2.0.0 R.oo_1.27.1
[109] BiocSingular_1.22.0 magrittr_2.0.4 GenomeInfoDbData_1.2.13 ggplotify_0.1.2
[113] wordcloud_2.6 Rcpp_1.1.0 ape_5.8-1 viridis_0.6.5
[117] reticulate_1.43.0 stringi_1.8.7 ggraph_2.2.2 zlibbioc_1.52.0
[121] MASS_7.3-65 plyr_1.8.9 parallel_4.4.0 Biostrings_2.74.1
[125] graphlayouts_1.2.2 splines_4.4.0 hms_1.1.3 locfit_1.5-9.12
[129] igraph_2.1.4 markdown_2.0 reshape2_1.4.4 ScaledMatrix_1.14.0
[133] XML_3.99-0.19 evaluate_1.0.5 BiocManager_1.30.26 tzdb_0.5.0
[137] tweenr_2.0.3 httpuv_1.6.16 openssl_2.3.3 polyclip_1.10-7
[141] gridBase_0.4-7 ggforce_0.5.0 rsvd_1.0.5 xtable_1.8-4
[145] reactome.db_1.89.0 RSpectra_0.16-2 tidytree_0.4.6 later_1.4.4
[149] ragg_1.5.0 viridisLite_0.4.2 truncnorm_1.0-9 aplot_0.2.9
[153] memoise_2.0.1 timechange_0.3.0
Package Versions:
Package Version
bookdown "bookdown" "0.44"
clusterProfiler "clusterProfiler" "4.14.6"
DESeq2 "DESeq2" "1.46.0"
dplyr "dplyr" "1.1.4"
EnhancedVolcano "EnhancedVolcano" "1.24.0"
ggplot2 "ggplot2" "4.0.0"
org.Ce.eg.db "org.Ce.eg.db" "3.20.0"
PCAtools "PCAtools" "2.18.0"
ReactomePA "ReactomePA" "1.50.0"
Analysis Parameters:
Report generated: 2025-11-26 15:09:25.570865
Working directory: /Users/masonmatich/Documents/research_labs/projects/bulk_RNA_seq_workflow